Goto

Collaborating Authors

 regression 0


Dimensionality Reduction on IoT Monitoring Data of Smart Building for Energy Consumption Forecasting

Koutras, Konstantinos, Bompotas, Agorakis, Halkiopoulos, Constantinos, Kalogeras, Athanasios, Alexakos, Christos

arXiv.org Artificial Intelligence

The Internet of Things (IoT) plays a major role today in smart building infrastructures, from simple smart-home applications, to more sophisticated industrial type installations. The vast amounts of data generated from relevant systems can be processed in different ways revealing important information. This is especially true in the era of edge computing, when advanced data analysis and decision-making is gradually moving to the edge of the network where devices are generally characterised by low computing resources. In this context, one of the emerging main challenges is related to maintaining data analysis accuracy even with less data that can be efficiently handled by low resource devices. The present work focuses on correlation analysis of data retrieved from a pilot IoT network installation monitoring a small smart office by means of environmental and energy consumption sensors. The research motivation was to find statistical correlation between the monitoring variables that will allow the use of machine learning (ML) prediction algorithms for energy consumption reducing input parameters. For this to happen, a series of hypothesis tests for the correlation of three different environmental variables with the energy consumption were carried out. A total of ninety tests were performed, thirty for each pair of variables. In these tests, p-values showed the existence of strong or semi-strong correlation with two environmental variables, and of a weak correlation with a third one. Using the proposed methodology, we manage without examining the entire data set to exclude weak correlated variables while keeping the same score of accuracy.


Multilingual Hope Speech Detection: A Comparative Study of Logistic Regression, mBERT, and XLM-RoBERTa with Active Learning

Abiola, T. O., Abiodun, K. D., Olumide, O. E., Adebanji, O. O., Calvo, O. Hiram, Sidorov, Grigori

arXiv.org Artificial Intelligence

Hope speech language that fosters encouragement and optimism plays a vital role in promoting positive discourse online. However, its detection remains challenging, especially in multilingual and low-resource settings. This paper presents a multilingual framework for hope speech detection using an active learning approach and transformer-based models, including mBERT and XLM-RoBERTa. Experiments were conducted on datasets in English, Spanish, German, and Urdu, including benchmark test sets from recent shared tasks. Our results show that transformer models significantly outperform traditional baselines, with XLM-RoBERTa achieving the highest overall accuracy. Furthermore, our active learning strategy maintained strong performance even with small annotated datasets. This study highlights the effectiveness of combining multilingual transformers with data-efficient training strategies for hope speech detection.


Aleks: AI powered Multi Agent System for Autonomous Scientific Discovery via Data-Driven Approaches in Plant Science

Jin, Daoyuan, Gunner, Nick, Janke, Niko Carvajal, Baruah, Shivranjani, Gold, Kaitlin M., Jiang, Yu

arXiv.org Artificial Intelligence

Modern plant science increasingly relies on large, heterogeneous datasets, but challenges in experimental design, data preprocessing, and reproducibility hinder research throughput. Here we introduce Aleks, an AI-powered multi-agent system that integrates domain knowledge, data analysis, and machine learning within a structured framework to autonomously conduct data-driven scientific discovery. Once provided with a research question and dataset, Aleks iteratively formulated problems, explored alternative modeling strategies, and refined solutions across multiple cycles without human intervention. In a case study on grapevine red blotch disease, Aleks progressively identified biologically meaningful features and converged on interpretable models with robust performance. Ablation studies underscored the importance of domain knowledge and memory for coherent outcomes. This exploratory work highlights the promise of agentic AI as an autonomous collaborator for accelerating scientific discovery in plant sciences.


Predicting Fetal Birthweight from High Dimensional Data using Advanced Machine Learning

Kapure, Nachiket, Joshi, Harsh, Mistri, Rajeshwari, Kumari, Parul, Mali, Manasi, Purohit, Seema, Sharma, Neha, Panday, Mrityunjoy, Yajnik, Chittaranjan S.

arXiv.org Artificial Intelligence

Birth weight serves as a fundamental indicator of neonatal health, closely linked to both early medical interventions and long-term developmental risks. Traditional predictive models, often constrained by limited feature selection and incomplete datasets, struggle to achieve overlooking complex maternal and fetal interactions in diverse clinical settings. This research explores machine learning to address these limitations, utilizing a structured methodology that integrates advanced imputation strategies, supervised feature selection techniques, and predictive modeling. Given the constraints of the dataset, the research strengthens the role of data preprocessing in improving the model performance. Among the various methodologies explored, tree-based feature selection methods demonstrated superior capability in identifying the most relevant predictors, while ensemble-based regression models proved highly effective in capturing non-linear relationships and complex maternal-fetal interactions within the data. Beyond model performance, the study highlights the clinical significance of key physiological determinants, offering insights into maternal and fetal health factors that influence birth weight, offering insights that extend over statistical modeling. By bridging computational intelligence with perinatal research, this work underscores the transformative role of machine learning in enhancing predictive accuracy, refining risk assessment and informing data-driven decision-making in maternal and neonatal care. Keywords: Birth weight prediction, maternal-fetal health, MICE, BART, Gradient Boosting, neonatal outcomes, Clinipredictive.


Computing Gram Matrix for SMILES Strings using RDKFingerprint and Sinkhorn-Knopp Algorithm

Ali, Sarwan, Mansoor, Haris, Chourasia, Prakash, Khan, Imdad Ullah, Patterson, Murray

arXiv.org Artificial Intelligence

In molecular structure data, SMILES (Simplified Molecular Input Line Entry System) strings are used to analyze molecular structure design. Numerical feature representation of SMILES strings is a challenging task. This work proposes a kernel-based approach for encoding and analyzing molecular structures from SMILES strings. The proposed approach involves computing a kernel matrix using the Sinkhorn-Knopp algorithm while using kernel principal component analysis (PCA) for dimensionality reduction. The resulting low-dimensional embeddings are then used for classification and regression analysis. The kernel matrix is computed by converting the SMILES strings into molecular structures using the Morgan Fingerprint, which computes a fingerprint for each molecule. The distance matrix is computed using the pairwise kernels function. The Sinkhorn-Knopp algorithm is used to compute the final kernel matrix that satisfies the constraints of a probability distribution. This is achieved by iteratively adjusting the kernel matrix until the marginal distributions of the rows and columns match the desired marginal distributions. We provided a comprehensive empirical analysis of the proposed kernel method to evaluate its goodness with greater depth. The suggested method is assessed for drug subcategory prediction (classification task) and solubility AlogPS ``Aqueous solubility and Octanol/Water partition coefficient" (regression task) using the benchmark SMILES string dataset. The outcomes show the proposed method outperforms several baseline methods in terms of supervised analysis and has potential uses in molecular design and drug discovery. Overall, the suggested method is a promising avenue for kernel methods-based molecular structure analysis and design.


Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models

Nezhad, Sina Bagheri, Agrawal, Ameeta, Pokharel, Rhitabrat

arXiv.org Artificial Intelligence

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.


A Comparative Study of Hyperparameter Tuning Methods

Dasgupta, Subhasis, Sen, Jaydip

arXiv.org Artificial Intelligence

The study emphasizes the challenge of finding the optimal trade-off between bias and variance, especially as hyperparameter optimization increases in complexity. Through empirical analysis, three hyperparameter tuning algorithms Tree-structured Parzen Estimator (TPE), Genetic Search, and Random Search are evaluated across regression and classification tasks. The results show that nonlinear models, with properly tuned hyperparameters, significantly outperform linear models. Interestingly, Random Search excelled in regression tasks, while TPE was more effective for classification tasks. This suggests that there is no one-size-fits-all solution, as different algorithms perform better depending on the task and model type. The findings underscore the importance of selecting the appropriate tuning method and highlight the computational challenges involved in optimizing machine learning models, particularly as search spaces expand.


SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models

Yin, Chun, Chi, Tai-Shih, Tsao, Yu, Wang, Hsin-Min

arXiv.org Artificial Intelligence

Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.


PRIME: A CyberGIS Platform for Resilience Inference Measurement and Enhancement

Mandal, Debayan, Zou, Dr. Lei, Wilkho, Rohan Singh, Abedin, Joynal, Zhou, Bing, Cai, Dr. Heng, Baig, Dr. Furqan, Gharaibeh, Dr. Nasir, Lam, Dr. Nina

arXiv.org Artificial Intelligence

In an era of increased climatic disasters, there is an urgent need to develop reliable frameworks and tools for evaluating and improving community resilience to climatic hazards at multiple geographical and temporal scales. Defining and quantifying resilience in the social domain is relatively subjective due to the intricate interplay of socioeconomic factors with disaster resilience. Meanwhile, there is a lack of computationally rigorous, user-friendly tools that can support customized resilience assessment considering local conditions. This study aims to address these gaps through the power of CyberGIS with three objectives: 1) To develop an empirically validated disaster resilience model - Customized Resilience Inference Measurement designed for multi-scale community resilience assessment and influential socioeconomic factors identification, 2) To implement a Platform for Resilience Inference Measurement and Enhancement module in the CyberGISX platform backed by high-performance computing, 3) To demonstrate the utility of PRIME through a representative study. CRIM generates vulnerability, adaptability, and overall resilience scores derived from empirical hazard parameters. Computationally intensive Machine Learning methods are employed to explain the intricate relationships between these scores and socioeconomic driving factors. PRIME provides a web-based notebook interface guiding users to select study areas, configure parameters, calculate and geo-visualize resilience scores, and interpret socioeconomic factors shaping resilience capacities. A representative study showcases the efficiency of the platform while explaining how the visual results obtained may be interpreted. The essence of this work lies in its comprehensive architecture that encapsulates the requisite data, analytical and geo-visualization functions, and ML models for resilience assessment.


Modeling Large-Scale Walking and Cycling Networks: A Machine Learning Approach Using Mobile Phone and Crowdsourced Data

Saberi, Meead, Lilasathapornkit, Tanapon

arXiv.org Artificial Intelligence

Walking and cycling are known to bring substantial health, environmental, and economic advantages. However, the development of evidence-based active transportation planning and policies has been impeded by significant data limitations, such as biases in crowdsourced data and representativeness issues of mobile phone data. In this study, we develop and apply a machine learning based modeling approach for estimating daily walking and cycling volumes across a large-scale regional network in New South Wales, Australia that includes 188,999 walking links and 114,885 cycling links. The modeling methodology leverages crowdsourced and mobile phone data as well as a range of other datasets on population, land use, topography, climate, etc. The study discusses the unique challenges and limitations related to all three aspects of model training, testing, and inference given the large geographical extent of the modeled networks and relative scarcity of observed walking and cycling count data. The study also proposes a new technique to identify model estimate outliers and to mitigate their impact. Overall, the study provides a valuable resource for transportation modelers, policymakers and urban planners seeking to enhance active transportation infrastructure planning and policies with advanced emerging data-driven modeling methodologies.